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Genetic association and linkage studies can provide insights into complex disease biology, guiding the de- 
velopment of new diagnostic and therapeutic strategies. Over the past decade, genetic association studies 
have largely focused on common, easy to measure genetic variants shared between many individuals. 
These common variants typically have subtle functional consequence and translating the resulting associ- 
ation signals into biological insights can be challenging. In the last few years, exome sequencing has 
emerged as a cost-effective strategy for extending these studies to include rare coding variants, which 
often have more marked functional consequences. Here, we provide practical guidance in the design and 
analysis of complex trait association studies focused on rare, coding variants. 



studies have been proposed (32-34) and focused candidate 
gene-sequencing studies have been undertaken, with promis- 
ing results (35-43). 

We have been involved in the planning, execution and ana- 
lysis of several exome-sequencing studies encompassing infor- 
mation on > 10 000 individuals. In this review, we focus on 
the practical aspects of such studies, highlighting important 
issues to consider when undertaking or evaluating exome- 
sequencing studies to dissect complex trait genetics. Given 
the rapidly changing nature of the field, we have tried not to 
be prescriptive. Rather, we encourage readers to carefully con- 
sider a series of key questions when evaluating alternatives for 
study design, generation of sequence data and variant calling, 
quality control of the resulting data, rare variant association 
analysis and follow-up approaches (Fig. 1). 



STUDY DESIGN: SAMPLE SELECTION 

Perhaps the most important step in any exome-sequencing 
study is the choice of samples to sequence. As with any 
genetic study, we encourage researchers to start by clearly 
stating their objectives at the outset (is the objective to 
survey the range of variation in normal individuals, to find 
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INTRODUCTION 

Over the past decade, genome-wide association studies have 
identified hundreds of common risk alleles for complex 
human diseases (1-9). These studies were enabled by a com- 
bination of the availability of large well-characterized sample 
collections (6-8, 10-13), advances in genotyping technolo- 
gies (14-16) and advances in methods for the analysis of 
the resulting data (17-20). These studies have provided 
several biological insights, highlighting the role of the comple- 
ment genes in age-related macular degeneration (21-23), of 
autophagy in Crohn's disease (24-26) or of specific regulatory 
proteins in blood lipid levels (6), among others. Still, translat- 
ing the resulting signals into function has been challenging 
because most common variants have only subtle functional 
consequences. 

Over the past several years, great advances have been made 
in sequencing and capture technologies, enabling accurate de- 
termination of nearly all protein-coding sequence variants in 
an individual (27-29). These exome-sequencing technologies 
have already accelerated genetic studies of Mendelian disor- 
ders (30) and there is great interest in extending them to 
complex traits (31). To support this goal, many methods for 
the design, analysis and interpretation of exome-sequencing 
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■ 




Will samples be selected from the phenotypic extremes? 






Will family-based samples be included? 






Will 'convenience' control samples be included? 


Study design 






Are case and control samples carefully matched by ancestry? 






Will the samples represent one or multiple ethnicities? 






What sequencing coverage will the samples be sequenced at? 


n 


■ 




Will genotypes be called using reads aligned to off-target regions? 


■ 








Has sample tracking been performed? 


Read mapping and 






Have reads been aligned to the reference genome, base quality 


variant calling 






scores calibrated and duplicate reads removed? 






Have sample quality metrics been evaluated after read mapping? 






Has quality control of variants and samples been performed? 




■ 








Have variants been annotated? 






How are variants with different annotations resolved? 


Association analysis 






Have single variant association tests been performed? 






What type of burden test to perform? 






What allele frequency threshold to use for gene burden test? 




m 




What variants to include in gene burden test? 






What approach to correct for multiple testing? 








Follow-up approaches 




What follow-up approach to perform: 


Hb 


genotyping, imputation or sequencing? 



Figure 1. Key questions and considerations for different stages of an exome sequencing study of complex disease. 



variants that predispose to risk of a specific disease, like dia- 
betes or myocardial infarction, to find variants that influence 
a specific quantitative trait, like glucose or lipid levels, or to 
simultaneously investigate a wide-range of quantitative out- 
comes?) and to systematically inventory all samples in 
which the traits of interest might be examined (these might 
include population samples, case and control series, and 
even families that might be segregating Mendelian forms of 
disease). 

Nearly always, the range of potentially informative samples 
exceeds the available sequencing budget. Therefore, careful 
consideration of which samples to sequence will be extremely 
important. In most instances, it will be fruitful to focus on 
samples with an extreme outcome (44-46) — for a quantitative 
trait, these are naturally defined as samples at the extremes of 
the trait distribution after accounting for known modifiers, 
which might include age, sex and diet but also previously 
identified genetic risk factors. For a discrete trait, these are 
samples whose outcomes are 'unusual' after accounting for 
previously known risk factors (46) — for example, individuals 
who present with myocardial infarction at an unusually 
young age. Another general strategy for increasing power is 
to focus on samples whose relatives have similarly extreme 
phenotypes (such as high lipid levels) or a history of disease 
(such as myocardial infarction) (47). 

Although selecting individuals with phenotypes that appear 
extreme or unusual based on known risk factors is important, 
other considerations can also greatly impact outcome of the 
study. For example, if a role for de novo mutation events is 
suspected, it will be extremely useful to sequence related indi- 
viduals (48-50) and, if the identification of individuals who 
are homozygous for rare loss-of-function alleles is desired, 



sequencing of individuals with evidence of inbreeding will 
be appealing (27). 

It is expected that many rare variants will have a very 
restricted geographic distribution (51,52) so that careful 
matching of case and control ancestries is likely to be extreme- 
ly important. In contrast to genome -wide association studies of 
common variants, where methods for removing artifacts due to 
mismatches between case and control ancestries are mature 
(18,53) and the use of 'convenience' control samples is rela- 
tively widespread, we expect that extreme care will be 
needed when using convenience controls in exome-sequencing 
studies because of the potential for false signals to be intro- 
duced by small differences in ancestry. As with genome-wide 
association studies, when these concerns can be overcome, 
convenience controls can provide for greatly increased 
sample sizes and power (54). 

Most protein-coding variants are extremely rare, previously 
undescribed and with a geographically restricted segregation 
pattern (52,55,56). Often, interesting and informative variants 
will segregate in a population-specific manner. For example, 
Y142X, a nonsense variant in PCSK9 that demonstrates that 
knockout of the gene results in greatly reduced low-density 
lipoprotein cholesterol levels and decreased coronary heart 
disease risk has frequency of 0.8% in African-ancestry indivi- 
duals but is virtually absent from European-ancestry samples 
(44). For this reason, the most complete exome-sequencing 
studies will examine individuals from a variety of ances- 
tries — with the expectation that segregating variants will 
provide insights about different (but potentially overlapping) 
subset of genes in each population. In this context, founder 
populations — where it may be possible to observe multiple 
copies of alleles that are otherwise extremely rare — may 
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prove very useful for exome-sequencing studies [just as they 
were for earlier studies of Mendelian disorders (57,58)]. 



STUDY DESIGN: SEQUENCING STRATEGY 

Standards for generation of high-quality exome sequence data 
are rapidly emerging. There are several good summaries of 
raw data quality, but it is common to aim for coverage with 
high-quality bases to reach 20 x or greater in 80-95% of 
the protein-coding sequences in each genome, after removal 
of ambiguously mapped reads and of duplicated reads 
(4,55). With this level of coverage, it should be possible to 
identify the vast majority of protein-coding variants with 
high specificity (55). Because the efficiency of enrichment 
protocols exhibits great local variation, achieving this level 
of coverage requires sequencing the protein-coding regions 
of each individual to an average depth of 60-80 x . 

Most protocols for targeted exome sequencing also result in 
relatively light coverage of the rest of the genome, typically on 
the range of 0.2-2.0 x on average. Although these 'off-target' 
reads are sometimes discarded in analyses, in our view, they 
can be extremely useful, particularly in samples that have 
not been genotyped with whole genome arrays. These off- 
target reads can be used to estimate the local or global ancestry 
of each sample (enabling improved case -control matching in 
association analyses or admixture mapping analyses), can be 
combined with a panel of reference haplotypes to estimate 
genotypes across the genome (59-61) and can facilitate detec- 
tion of large structural variants (such as deletions of entire 
genes) (62). 

VARIANT CALLING 

Once sequence data are generated, there are several steps 
required to process raw short read sequences into high-quality 
genotypes for each individual. Typically, we first check 
whether DNA samples have been contaminated and, if DNA 
fingerprints are available, also check whether samples were 
tracked correctly during processing (63,64). Next, the 
process proceeds to the alignment of short sequence reads to 
the reference genome (65-67), calibration of base-quality 
scores (68) and removal of duplicate reads (69). After this 
initial processing, it is useful to examine per sample quality 
metrics — which might include the fraction of the exome 
covered at various depths, after removal of duplicates and 
poorly mapped reads, evaluating the distribution of empirical 
base quality scores, and the relationship between coverage 
and GC content. Data for samples with outlier properties 
such as a low fraction of the genome covered or low base 
quality scores can be excluded, flagged and/or reprocessed. 

After this step, the reads overlapping each position are 
inspected to identify variant sites. Typically, these sites will 
be covered by many reads that differ from the reference 
genome (68,70). The initial list of variant sites is then 
inspected by a machine-learning-based classifier that tries to 
separate variants likely to be polymorphic from those that 
might be calling artifacts (lists of known variants and 
common artifacts generated by the 1000 Genomes Project 
can often be used to train these classifiers) (4,68,71). To 



distinguish true and false positive variants, the machine learn- 
ing classifiers typically evaluate metrics like the mapping 
quality of reads supporting each allele, the fraction of reads 
supporting the alternate allele in putative heterozygotes and 
sequencing depth. In very small data sets, it may not be prac- 
tical to tune machine-learning-based classifiers, and it may be 
necessary to manually review each of these quality metrics to 
determine appropriate quality cut-offs for each quantity (31). 
Note that, while variant calls can be generated across the 
entire genome, producing accurate genotypes in regions that 
are not deeply covered typically requires an additional post- 
processing step — using a haplotype aware genotype caller 
(59,72,73). These haplotype aware callers are quite useful 
for variants shared across many individuals but are not 
useful for the rarest variants (including private variants). We 
also note that calling of insertion-deletion polymorphisms 
remains especially challenging and that improved analysis of 
these important variants will likely require a new generation 
of sequence analysis tools. 

At this stage in the process, it is again common to generate a 
series of quality metrics — these might include the number of 
variants per individual (typically, we expect 10 000-12 500 
synonymous variants, 9500-12 000 non-synonymous variants 
and 100-200 stop or splice altering variants per individual), 
the fraction of variants in each category that is unique to 
each sequenced sample (typically, we expect that nearly all 
the variants in each sample have been previously described), 
the fraction of heterozygous sites per sample and the fraction 
of coding indels that result in a frameshift. Samples with 
unusual profiles can be flagged, reprocessed or excluded 
from downstream analyses (55). Within each of these categor- 
ies, it is also common to compare the transition-transversion 
ratio of new and previously described variants (74). The tran- 
sition-transversion ratio is a useful diagnostic metric because, 
in nature, transitions (A<-> G and C <-> T) occur much 
more often than transversions (A <-> C, A <-> T, G 
<-> C or G <-> T). For the exome, we expect the ratio 
to be a little above 2.0 for non-synonymous variants and 
above 5.0 for synonymous variants (55, 71). It is often a 
good idea to manually review the evidence supporting a 
random subset of the sites — for example, using the integrative 
genomics viewer (75,76) — and this review should always be 
earned out for the key variants supporting a manuscript or 
novel finding. If sufficient resources are available, genotyping 
or Sanger sequencing of putative carriers can validate a subset 
of newly identified variants. 

Although it is not yet standard to do so, we recommend that 
the depth of coverage with high-quality bases and the fraction 
of samples reaching coverage of 20 x or greater at each pos- 
ition should also be recorded for each position. These quan- 
tities facilitate comparisons between exome-sequencing 
studies, helping distinguish regions where one study found 
variation and another study had poor coverage from regions 
where there truly are differences in the rate of variation 
across studies. 

While there are many reasonable choices for these steps 
(ranging from the choice of read mapper, specific criteria for 
filtering poorly mapped reads, criteria for declaring variant 
calls to be high quality), we note that these choices — just 
like choices of sequencing and exome capture technology 
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and protocol — do have a small impact on results and can make 
it difficult to directly contrast samples analyzed with different 
protocols. In particular, in a few hard to interpret regions or 
genes, different analytical protocols (or variations on the 
same protocol) can result in markedly different lists of var- 
iants. A welcome development in this area is the development 
of standards for storing sequence data (69) and resulting 
variant calls (77), which make it easy for tools developed in 
different groups to interoperate. 



ASSOCIATION ANALYSIS 

The final step before association analysis is annotation of func- 
tional effects for each variant. There are now reliable, widely 
used tools for this purpose (78-81). According to their impact 
on protein-coding transcripts, these tools can identify single 
nucleotide variants that result in synonymous, missense, non- 
sense, splice site alterations [typically defined as within 2 bp 
of an intron-exon boundary, as supported by empirical ana- 
lyses (82)] or read-through alleles; indels are typically anno- 
tated according to whether or not they result in a frameshift 
or not. Typically, they also assign each variant a score, 
based on analysis of protein structure or evolutionary conser- 
vation, to separate variants with little functional impact from 
those more likely to damage protein function (83,84). A strat- 
egy must be selected for dealing with variants that have mul- 
tiple annotations — for example, a variant might alter the 
protein-coding sequence for one transcript but not for other 
overlapping transcripts. These annotation conflicts can be 
resolved by focusing only on canonical transcripts for each 
gene (for example, RefSeqGene), by focusing on the longest 
transcript in each gene, or by using the most deleterious pre- 
diction from all available transcripts. 

We recommend that every analysis of exome sequence data 
should start with single variant association tests. While these 
tests are typically not well powered for rare variants (most of 
which will be seen only once or twice, even in very large data- 
sets), they provide a convenient opportunity to quality check 
the data — by verifying that previously reported common 
variant signals are reproduced and by inspecting genome-wide 
QQ plots to ensure samples are adequately matched and 
results are not unduly influenced by population structure (85). 

Because most variants are individually rare, achieving ad- 
equate statistical power requires a design where additional 
copies of the variant of interest can be sampled (perhaps in 
a family study or in a founder population) or the ability to 
combine and evaluate groups of variants likely to have 
similar function (86). The basic idea behind most rare 
variant association tests is to group variants likely to have 
an impact on the function of a specific gene and to compare 
the distribution of these variant groupings to the distribution 
of the trait of interest. 

There are two major categories of association tests for 
groups of rare variants. In one type of test, the total number 
of rare alleles across a gene is tabulated in each individual 
and these totals are compared between cases and controls, 
for a discrete trait, or correlated with trait values, for a quan- 
titative trait (32). These tests can be carried out by assigning 
all variants the same weight or they can be designed to 



place more weight on rarer variants and other variants that 
are expected to have more severe functional consequences 
(87,88). While early versions of these tests require explicit 
allele frequency cut-offs for defining rare variants, newer ver- 
sions use adaptive thresholds whose choice is guided by avail- 
able data (89). 

Another type of test allows for the situation where a gene 
might harbor both deleterious and protective variants. 
Instead of comparing the total number of variants per individ- 
ual, these tests examine whether the number of variants with 
non-zero effect sizes (whether positive or negative) exceeds 
chance expectations (33,89,90). In general, we recommend 
that at least one test from each category (that is, one burden 
test assuming all alleles impact the trait in the same direction 
and one burden test allowing for alleles with opposite direc- 
tions of effect in each gene) should be considered and that 
variable threshold implementations of these tests should be 
used. When it is not practical to use variable threshold 
methods, we recommended that a variety of frequency 
cut-offs should be considered (for example, 0.05, 0.01 and 
0.001). An additional analysis, focused on individuals who 
are homozygous or compound heterozygous for deleterious 
variants in a gene, might eventually become a useful comple- 
ment to these tests — because it focuses explicitly in indivi- 
duals where gene function might be ablated. 

A number of packages under active development now imple- 
ment a variety of these tests (89-91, http://genome.sph.umich. 
edu/wiki/EPACTS, http://atgu.mgh.harvard.edu/plinkseq). In 
addition to implementing multiple tests, these packages make 
it simple to consider different subsets of the data for analysis. 
For example, an initial analysis might include all missense, 
splice or stop altering variants, excluding only synonymous 
and non-coding variants. Since many missense variants will 
not significantly impact protein function (92,93), a second ana- 
lysis might focus on the subset of these variants that are pre- 
dicted to have deleterious consequences. And an even more 
restricted analysis might focus only on splice, frame and 
stop-altering variants among this later set (94). 

We expect there will be no optimal statistical test, filtering 
strategy or frequency cut-off for gene-based tests. The spec- 
trum of functional variants and their characteristics will 
likely differ between genes, depending on the importance of 
the gene's function for the organisms overall function and 
the luck of the evolutionary draw. Given the multiplicity of 
statistical tests (and of filtering strategies used to decide 
which variants are proposed as input for these tests), 
permutation-based approaches should be used for evaluating 
statistical significance. Permutations can naturally account 
for the fact that some genes have very few rare alleles (and 
thus can never produce a significant burden test result) and 
that multiple correlated tests might have been undertaken 
(31). In the absence of permutation-based significance thresh- 
olds, a good rule of thumb is that burden test results from 
exome-sequencing studies should reach P-values on the 
order of 5 x 10~ 7 or less before being declared significant 
(this stringent threshold accounts for the number of genes 
tested but also for the variety of tests that must be considered 
and the choice of variants to test inherent in the analysis of 
these studies). Just as with single variant tests, we recommend 
generating QQ plots to summarize association results across 
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the genome and ensure test statistics are well behaved. We 
note that it is valid to combine results for all the tests consid- 
ered (single variant, burden tests using different frequency 
thresholds and/or aggregation strategies, etc.) into a single 
QQ plot. 



APPROACHES FOR FOLLOW-UP OF PROMISING 
SIGNALS 

In some rare cases, exome sequencing of a single large sample 
will be sufficient to demonstrate association (perhaps after 
technical validation of key genotypes, to show that they are 
not genotyping artifacts). More often, it will be necessary to 
examine the most promising variants in additional samples 
(95). A range of approaches are available for follow-up, 
ranging from in silico approaches (based on genotype imput- 
ation) to targeted genotyping or targeted sequencing. 

SNPs with frequencies > 1% can usually be tested in thou- 
sands of samples by direct genotyping or imputation since 
these SNPs are frequent enough to be tested individually. A 
recent Crohn's disease-sequencing study illustrates the possi- 
bilities (96): after analysis of sequence data for 350 cases 
and 350 controls, 70 variants were examined in > 16 000 add- 
itional cases of Crohn's, > 12 000 cases of ulcerative colitis 
and 17 000 controls — resulting in a clear association for a 
splice variant in CARD9 (allele frequency = 0.2-0.7%, odds 
ratio = 0.29, P < 1 x 10~ 16 ). An important extension of this 
approach are studies that attempt to examine essentially all, 
or most, of the variants discovered in a sequencing experiment 
in very large numbers of additional samples. One notable set 
of these experiments, currently underway, are the exome 
chip experiments. These experiments use arrays designed to 
include >250 000 non-synonymous variants identified by se- 
quencing > 12 000 individuals and are being genotyped on 
> 1 000 000 individuals to explore genetic contributions to a 
great variety of traits. A limitation of exome chips is that 
they will miss a significant fraction (~ 15-20%) of variants 
because their genomic context is incompatible with array- 
based genotyping, variants highly specific to non-European 
populations (~ 10 000 of the 12 000 sequenced individuals 
considered for the design of exome chip were of European an- 
cestry) as well as the rarest variants in any population. Still, 
because of their focus on very rare coding variation (the vast 
majority of variants on the exome chip have frequency 
<0.5%), the analyses of exome chip experiments will be 
more similar to the analysis of exome-sequencing studies 
than to the analysis of genome -wide association studies — 
requiring careful attention to ancestry matching and the con- 
sideration of tests that consider many coding variants in a 
gene, for example. While these exome chip studies will only 
provide an imperfect approximation to the results of sequen- 
cing studies, we hope they will provide a preview of the dis- 
coveries that will be possible when exome sequencing is 
applied to 100 000 s of samples. 

When a very large number of individuals with exome se- 
quence data and whole genome genotypes is available, statis- 
tical imputation can also provide an effective strategy for 
extending sample sizes (97,98). The approach can be relatively 
fast and economical. Currently, sufficiently large reference 



panels that can support imputation of very rare variants are 
not available for most cosmopolitan populations. However, 
several examples of the success of this approach exist, many 
from the isolated population of Iceland. There, relatively 
limited genetic diversity, a panel of sequenced Icelanders, 
and the availability of 10 000 s of genotyped individuals 
have enabled recent discoveries using imputation. MYH6 
L721W (a variant with allele frequency of 0.4%) was evalu- 
ated in 38 000 individuals and associated with the risk for 
sick sinus syndrome (odds ratio = 12.5, P = 2 x 10~ 29 ) (99) 
and of variant APP A673T (allele frequency 0.1%) was eval- 
uated in 71 000 individuals and associated with the risk for 
Alzheimer's disease (odds ratio = 5.29 and P = 5 x 10 -27 ) 
(100). 

When targeted genotyping and imputation are not possible 
or when the association signal is driven by a burden of very 
rare mutations (101), it will be necessary to undertake targeted 
sequencing of genes prioritized on the basis of initial analyses. 
While current methods for sequencing 50-200 genes in 
10 000 s of samples are cumbersome, this is an area of 
active technology development where we expect important 
advances will soon be available. These advances should 
perform at a fraction of the cost of traditional Sanger sequen- 
cing and will allow follow-up of exome-sequencing studies to 
explore promising signals due to a burden of rare variants. 



THE ROLE OF FUNCTIONAL ASSAYS IN 
INTERPRETING EXOME-SEQUENCING STUDIES 

Genetic analyses that consider groups of rare variants will 
improve in power if functional variants can be separated 
from those that have no impact on function so that association 
tests and follow-up experiments can focus on the functional 
variants. In this context, functional or computational assays 
that identify variants most likely to impact gene function — 
particularly when they can be carried out on a genomic 
scale — could play a very important role in the successful inter- 
pretation of exome-sequencing studies. As these functional 
assays are expanded to the rest of the genome, they will 
likely play a critical role in expanding studies of rare variation 
beyond the exome and to the rest of the genome — where iden- 
tifying, aggregating and grouping functional variants remain 
much harder. 

Functional characterization of non-synonymous changes 
will also help interpret rare variant association signals and 
help transform genetic findings into precise mechanistic 
insights. Functional studies can reveal the specific molecular 
changes consequences of coding variation on gene products, 
as well as the molecular mechanisms by which genes 
produce disease (102). However, such functional data, when 
used to support statistical signals that cannot stand on their 
own, are susceptible to many biases (94). The historical 
example of candidate gene association studies is inform- 
ative — in that setting, the widespread use of functional infor- 
mation to support marginal genetic association signals 
produced a situation where many published findings were irre- 
producible and most such studies are now discounted. In our 
view, claims of significance for marginal statistical signals 
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based on modest functional evidence should be considered 
only when generating additional genetic data is impossible. 

We encourage human geneticists to carefully plan and con- 
sider the functional experiments that will follow identification 
of robust, rare variant association signals. However, in most 
cases, these experiments should only be undertaken when 
the initial association signal is clearly established. As noted 
above, we do make an exception for high-throughput assays 
that attempt to separate variants that are likely to be functional 
from those are that are likely to be neutral — for example, so as 
to focus burden analysis on the most deleterious variants. 
These analyses can be productively used in the discovery 
process — however, if not used judiciously, they will require 
the use of even more stringent thresholds because they 
imply an additional round of statistical tests and potential 
false discoveries. 



FORWARD GENETICS 

Instead of characterizing function in model systems, exome 
sequencing potentially allows for evaluating the functional 
consequence of pathogenic mutations directly in humans. It 
is now possible to envision an era of 'forward genetics' in- 
volving humans. The concept involves understanding gene 
function by identifying patients harboring specific mutations 
and characterizing the physiologic and clinical consequences 
of these mutations. Direct study of rare, human 'knock-out' 
variants may be particularly illustrative (103-105). For 
example, humans heterozygous or homozygous for knockout 
alleles at several plasma lipid genes have been identified and 
detailed study of these individuals has led to new biologic 
insights. 

WHERE DO WE GO FROM HERE? 

Exome sequencing has already been successful at identifying 
the genetic cause of many Mendelian disorders. While appli- 
cations of exome sequencing to common, complex diseases 
will be more challenging, we expect that the continued avail- 
ability of high-quality phenotyped samples, combined with 
advances in sequencing technology and analytical methods, 
will soon allow > 10 000 s of individuals to be examined for 
many common outcomes and quantitative traits. As large 
numbers of sequenced individuals become available, a particu- 
lar challenge will be the development of appropriate methods 
for combining information (or results) across studies that 
might have used different sequencing platforms or analytical 
approaches for converting sequence data into genotypes. In 
the context of common variant association studies, such 
approaches have been instrumental in the rapid rate of discov- 
ery of the past few years. In the context of rare variant studies, 
we believe that new protocols and statistical methods that 
allow rare variant burden tests to be reconstructed through 
meta-analysis of study specific summary statistics will be ex- 
tremely useful. 

As larger exome-sequencing studies become common place, 
and the barriers to cross study analyses are surmounted, 
perhaps a harvest of specific biological insights will arrive — 
producing a great need for cellular and model organism 



systems where these hypotheses can be evaluated. As for 
human geneticists, we predict they will then be ready to con- 
tinue their systematic exploration of the genome — proceeding 
from common variants, to rare coding variants, to a systematic 
evaluation of all variation (including rare non-coding vari- 
ation) using whole genome-sequencing approaches. 
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